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Abstract. Erasure coding techniques are getting integrated in networked dis- 
tributed storage systems as a way to provide fault-tolerance at the cost of less 
storage overhead than traditional replication. Redundancy is maintained over time 
through repair mechanisms, which may entail large network resource overheads. 
In recent years, several novel codes tailor-made for distributed storage have been 
proposed to optimize storage overhead and repair , such as Regenerating Codes 
that minimize the per repair traffic, or Self-Repairing Codes which minimize the 
number of nodes contacted per repair. Existing studies of these coding techniques 
are however predominantly theoretical, under the simplifying assumption that 
only one object is stored. They ignore many practical issues that real systems 
must address, such as data placement, de/correlation of multiple stored objects, 
or the competition for limited network resources when multiple objects are re- 
paired simultaneously. This paper empirically studies the repair performance of 
these novel storage centric codes with respect to classical erasure codes by sim- 
ulating realistic scenarios and exploring the interplay of code parameters, failure 
characteristics and data placement with respect to the trade-offs of bandwidth 
usage and speed of repairs. 
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1 Introduction 

Nowadays large data storage architectures such as Google FS [9| or Amazon S3 J2] 
are built upon networked distributed storage systems that spread data over several com- 
modity storage servers. To ensure that data survives disk failures, these storage sys- 
tems must keep the original data together with some amount of redundancy. A 3-way 
replication has traditionally been used to that effect. However, today's systems such 
as Microsoft Azure J6], Hadoop FS |3| or the new Google FS, are increasingly using 
redundancy schemes based on erasure codes due to their capability to reduce storage 
overheads (7J|8]|24|. For example, in deployed systems erasure codes can reliably store 
data with an overhead of 1.3x-1.5x the size of the original data [5J[6]. 

One of the main problems in networked distributed storage systems is to replenish 
redundancy over time as storage nodes fail. With replication and classical erasure codes 
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(e.g. Reed-Solomon codes), repairing a missing piece of data entails the same commu- 
nication cost as transmitting a whole data object |22|, which causes a massive utilization 
of network resources. Furthermore, a single node is responsible for the repair of a given 
object, which might create network bottlenecks, slowing down the repair process and 
compromising data reliability. This triggered a line of research addressing the design 
of novel codes tailor-made for distributed storage which facilitate the repair process: 
for example, Regenerating Codes (RGC) ll5l [TTl[T9l are designed to minimize the repair 
traffic, while Self-Repairing Codes (SRC) fi31 aim at reducing the number of nodes 
contacted per repair. However, these novel code families have so far only been studied 
theoretically, in a simplistic setting where only one object is stored. Storing multiple 
objects instead forces to take into account the repair of multiple failures for the same 
object concurrently |4|, which in turn might be affected by different chosen data place- 
ment strategies and consequent contention of limited and shared network resources. 
This paper is a step to explore 'How do these novel codes, whose corresponding the- 
ories predict significant improvements on repairing objects in isolation, actually work 
under realistic settings?' . 

Encoding an object to be stored over a network using an erasure code consists of 
splitting the object into k fragments, that are transformed into n > k redundant frag- 
ments, stored in distinct storage nodes in the system. The transformation is such that the 
original object can be reconstructed from a subset of these redundant fragments. When 
a storage node fails, it can be repaired by downloading some amount of data from a d- 
subset of live nodes (d < n — 1). RGC and SRC differ on how this d-subset is selected. 
RGC allow to select any d-subset out of the n — 1 live nodes, where d is a relatively 
large value (the larger the d, the lower the repair traffic in RGC, and hence d = n — 1 
is typically preferred), while SRC require much smaller values of d, usually as low as 
d = 2. In that sense, these two approaches, while addressing the same problem of better 
repairability in erasure coded storage systems, represent two extreme design points in 
terms of the repair degree d, that is the number of nodes to be contacted per repair. Both 
choices present their own pros and cons: 

- In terms of choices of d-subsets, RGC are more flexible in that a lost fragment 
can be repaired from any d-subset of live nodes, in contrast SRC require the use 
of specific d-subsets of nodes, although there are many such possible subsets to 
choose from. 

- The latency of the repair process can be severely affected if a single involved node 
is overloaded. Consequently, a large repair degree d makes RGC more vulnerable 
to overloaded nodes. Given the many possible choices of small subsets with which 
a repair can be carried out, SRC avoid bottlenecks as long as they can locate one 
such subset with unloaded live nodes. 

Besides that, when multiple nodes fail simultaneously, both coding schemes can be 
used to repair each failure independently from the others. The original design of SRC 
naturally preserves the good repairability properties for multiple failures, but the re- 
pair performance of RGC degrades in this situation because live nodes are more prone 
to provide data to several repair processes. However, a RGC variant called collabora- 
tive regenerating codes (CRGC) lfTT1[T9l have improved upon what can be achieved 
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by allowing several repair processes to contact a (possibly) different <i-subset before 
exchanging data among each other. 

Despite the problems related with coding schemes, in real distributed storage sys- 
tems nodes can be overloaded due to unbalanced object assignments ll23l . or bad task 
scheduling ETI . In special we are interested on these cases where some network end- 
points at each node might become overloaded, identifying to main situations where this 
can happen: 

- Nodes in real systems store data from multiple objects. When one of this nodes 
fails an independent repair process is triggered to repair each of the involved ob- 
jects. These simultaneous repair processes might overload some of the live nodes, 
creating network bottlenecks for repair processes. 

- Some data-intensive applications running in datacenters might require to send or re- 
ceive large amounts of data through the network (e.g., Map-Reduce tasks), causing 
temporally overloaded nodes. 

When some part of nodes are overloaded due to any of these two reasons the limited 
number of repair choices of SRC can make these codes more prone to experience slow 
repair times. However, RGC can download data from the first d live nodes with less 
network load, reducing the repair latency. This leads to the following central question: 
'Given some node fault patterns and network load constraints, which code can carry out 
the repairs at what rate, and what are the implications of the corresponding repair times 
on the system's resilience ?' The complete answers to these questions are not obvious 
even for a single stored object, but eventually, studying a single object is not adequate, 
since realistic environments require the storage of multiple objects, leading to further 
complexity (4). 

This paper empirically evaluates the repair performance of RGC and SRC in a sys- 
tem storing and maintaining multiple data objects. These two new code families were 
chosen since they represent the two possible extremes for optimizing the repair process: 
(i) RGC aims at the minimal absolute repair communication (recall that the larger value 
of the repair degree d the less the repair traffic), while (ii) SRC minimize the number of 
live nodes needed to carry out a repair, in fact, achieving d = 2. The low repair degree 
in turn leads to not only significant reduction of repair traffic, but also other benefits 
such as fast and parallel repairs. Additionally, we also contrast the results of these two 
novel codes with traditional erasure codes. Our analysis focuses on the required com- 
munication and repair speed, while varying the code parameters, failure characteristics, 
data placement and network load. 

Our study leads to several intuitive, as well as not so intuitive results. We confirm 
that both RGC and SRC reduce the maintenance communication overhead significantly 
with respect to traditional erasure codes. Regarding data placement, it appears that the 
repair process is significantly slowed down when data is placed in a clustered manner. 
A more interesting result concerns repair speed: while RGC mostly consume less band- 
width than SRC, a pipelined variation of SRC achieve significantly faster repair than all 
codes under all the settings that we have studied. We also observe that since bandwidth 
is an ephemeral resource, the more important thing is to utilize it in a balanced manner 
over time, and the very low value of d in SRC facilitate the same, to carry out fast and 
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parallel repairs - even when multiple objects, as well as multiple faults are considered. 
Finally, regarding the performance of repairs on overloaded networks we identify that 
repairs finalize significantly faster when the same overall load is caused by short and 
frequent "busy" node periods than with long and sporadic ones. 

The rest of the paper is organized as follows. In Section |2] we first present some 
background on erasure codes and related works proposing different constructions or 
analysis of codes for distributed storage systems. In Section[3]we provide essential in- 
formation on the codes we study, followed by a description of our simulation method- 
ology, including how we model the various properties of the codes as well as the net- 
worked storage environment. We present our findings in Section [4] Finally, in Section 
[5]we conclude by discussing the practical implications of the results. 

2 Background and Related Work 

Erasure codes have been largely studied in the storage literature as a mechanism to 
provide high data reliability and reduce the storage overhead required with respect to 
simple data replication. Given a data object of size B, an (n, k) erasure code splits this 
object into k smaller fragments each of size B/k. These k fragments are then mapped 
to a set of n redundant fragments, n > k, to be stored into n different nodes. If the code 
is a maximum distance separable (MDS) code, the stored object can be reconstructed 
from any fc-subset of redundant fragments. For example, 3-way replication is a (3,1) 
MDS erasure code that maps the object to three copies of itself. 

One of the main uses of erasure codes is to protect multiple disk failures in RAID-6 
disk configurations lUBl . where the n redundant fragments are distributed into small sets 
of disks -usually from four to eight disks. Since the value of n is small, the code can be 
implemented as a flat-XOR code: all redundant fragments are obtained by xoring some 
of the k original fragments lfl"8l . Furthermore, it was shown in [fl"2l that these flat-XOR 
codes can repair redundant fragments without needing to reconstruct the entire original 
object, by reading only d live fragments, where d < k. 

Erasure codes are also used in networked distributed storage systems to reduce the 
storage overhead down to 1.3x the size of the original object, n/k ~ 1.3 (3j|6l. In 
these systems the n redundant fragments are stored across different commodity storage 
servers, which are more prone to disk failures, power outages, network disconnections 
and software errors. Achieving a low storage overhead in these environments while 
guaranteeing a high data availability requires to spread data over larger sets of storage 
nodes. Since designing optimal flat-XOR codes for large n values, and finding effi- 
cient repairs (d < k) are two NP-hard problems iPTOl . existing solutions use traditional 
erasure codes like Reed-Solomon codes, i.e., codes designed for communication over 
noisy channels. They have already been extensively studied in the context of networked 
storage (e.g. JT3)), and require more complex finite-field operations. Their main draw- 
back is that repairing a missing fragment entails contacting d = k storage nodes, and 
the same communication cost as transmitting a whole data object l22l . 

Recently, some novel erasure codes have been designed to reduce this repair cost. 
Regenerating Codes (RGC) and Collaborative RGC (CRGC) HQED can be seen 
as a composition of an MDS erasure code and a network code JTJ, aiming at the min- 
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imal communication to repair one failure at a time for RGC, and several node failures 
in parallel for CRGC by enabling collaboration among repairing nodes. As compared 
to traditional erasure codes like Reed-Solomon, RGC and CRGC allow to reduce the 
repair communication at the cost of increasing the number of contacted nodes, d > k. 
Some variants of RGC like Minimum Bandwidth RGC allow to further reduce the 
repair communication at the additional cost of increasing the size of the redundant frag- 
ment (larger than B/k), making the codes less suitable for environments where mini- 
mizing the storage footprint is a priority. Furthermore, the recent Simple Regenerating 
Codes ifTTl allow to easily combine t = 2 classical MDS erasure codes to reduce the 
number of nodes contacted during repairs to d = 4. Unfortunately, this approach in- 
creases the overall storage overhead by 50% compared to RGC or CRGC. In general 
the overhead can be reduced to 100i _1 % by increasing the number of contacted nodes 
to d = 2t, but losing the potential to repair redundant fragments efficiently JT6). 

Finally, Self-Repairing Codes (SRC) lfT5l are a family of non-MDS erasure codes 
which like RGC and CRGC aim at minimizing the repair traffic and storage overhead, 
though this is attained by drastically reducing the number of live nodes contacted for a 
repair. To be specific, in SRC only d = 2 live nodes need to be contacted for a repair, 
allowing to repair up to (n — l)/2 simultaneous faults. In this paper we will compare 
the repair performance of these two families of coding techniques (RGC/CRGC and 
SRC) to classical Erasure Codes (EC). 



3 Theoretical and Experimental Settings 

In this section we will present the main theoretical features of the codes which are the 
subject of this study and we will describe the simulator framework that that we will use 
in Section|4]to evaluate the repair performance of these codes. 



3.1 Novel Coding Techniques 

By an erasure code (EC), we mean a map that encodes k fragments into n, with the 
property that any choice of k encoded fragments is enough to recover the encoded 
object -i.e., they have the MDS property. Each node stores an amount of data per object 
equal to B/k, which is the minimal amount possible to guarantee that objects can be 
reconstructed by retrieving k encoded fragments out of the total n. In order to repair one 
failure, a repair process, also called newcomer, downloads k encoded fragments, from 
which it recovers the object, and can thus recompute the missing encoded fragment. 
This is costly both in terms of download data and computation, though it becomes 
interesting in the case of lazy repair, where the system waits for / (/ <n — k) failures 
to accumulate before triggering the repairs. Such a repair procedure, where one node 
reconstructs the encoded fragments and then distributes them to other nodes has an 
average communication cost per failure (for one object and normalized by B/k) of 

lEC = J ■ (1) 
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Regenerating Codes (RGC) can be seen as erasure codes in that object recon- 
struction is similarly done by contacting any choice of k nodes (they have the MDS 
property). Repair is however done differently as we now explain by considering Col- 
laborative RGC (CRGC) HUGH, which include RGC as a special case. CRGC allow 
newcomers to collaboratively repair / failed fragments. In CRGC, repair is done in two 
steps: a download phase where the / repair processes download data from d live frag- 
ments, d > k, and a collaborative phase, where all the / nodes exchange data among 
themselves. When / = 1, CRGC are exactly RGC, however, in the event of multiple 
failures, it is more favorable to use CRGC than RGC which can only repair one failure 
at a time sequentially. The total amount of normalized repair traffic per failure of one 
object, when each node stores an amount of data B/k, is 

1CRGC = d-k + f> (2) 

which was derived analytically assuming a single stored object ifTTI . We note that when 
k = d, "Icrgc = Jec> thus EC can be considered as a special case of CRGC. 



Self-Repairing Codes (SRC) minimize the number of live nodes to be contacted 
for repair. In fact, SRC enables the repair of a single failure by contacting only 2 nodes, 
while / failures can be repaired by contacting only 2 nodes per repair for up to / < (n — 
l)/2 failures. More failures can be tolerated, but without the guarantee that contacting 
only 2 nodes will work. SRC thus cannot have the MDS property: take one fragment 
that can be obtained from two others, an object cannot be retrieved from k fragments 
including these 3 fragments. Still assuming B/k amount of data per object per node, 
the normalized repair communication per failure is 

2/ 

Isrc = -j = 2. (3) 

It is important to note that jec, Icrgc an d Isrc are theoretical bounds subject to 
the existence of explicit code constructions satisfying them. In this paper we focus on 
the potential repair performance of the different codes independently of the currently 
known code constructions. 



3.2 Pipelined Repairs 

Besides considering the regular repair procedure of RGC/CRGC and SRC, we introduce 
pipelining repairs for SRC (SRCp), adapted from fl4l . originally conceived for RGC 
with heterogeneous link capacities. Concretely, the repair process for SRCp does not 
download data directly from 2 nodes, but asks the first node to download data from 
the second. Then this first node encodes the received data byte-by-byte with the data it 
stores and forwards it to the node running the repair process . Thus, the repair can finish 
within the time to transmit only one fragment plus a small overhead, negligible for large 
objects, due to the time required to encode and transmit the first fragment bytes. We do 
not consider pipelined RGC since they use additional storage at auxiliary 'apprentice' 
nodes, hence are not strictly comparable, and their repair time is lower bounded by that 
of SRCp, though at a significantly higher implementation complexity. 
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3.3 Simulator Set Up 

To evaluate the behavior of different codes, we use a discrete time simulator where, 
at each time step, all nodes can send and receive (full-duplex network with symmet- 
ric bandwidth) a data packet of a fixed length. Outgoing packets from each node are 
queued until the destination node is available to receive it. Once a node receives data 
from an outgoing queue of some other node it will then refuse further requests during 
the same time round. Which request is accepted, and which others are declined is de- 
cided randomly without bias. Rejected nodes retry to find any other appropriate node 
to send their outgoing packets if that is possible. This simulation model allows to em- 
ulate network congestion on those nodes with more incoming/outgoing traffic. For the 
sake of simplicity but without loss generalization, we set the packet length to (5, which 
is the minimum amount of data that nodes transmit during repairs. We set the discrete 
step duration r to r = f3/(ui ■ eff), where lu is the upload/download bandwidth, set 
to oj = lGbps, and eff, ejjG (0, 1], is the network efficiency, set to ejj=0.8 (as a de- 
fault value). This ejjparameter allows us to emulate network overheads such as packet 
headers or retransmissions. 

Besides the network load caused by the repair traffic we also aim to simulate load 
due to real data-intensive processes like Map-Reduce tasks. To emulate these data- 
intensive processes we assume that nodes can have overloaded periods where they 
cannot send or receive repair data. Repair processes needing data from these over- 
loaded nodes will have to wait or find other suitable nodes, which might lengthen repair 
times and compromise data reliability. We will assume that overloaded periods at each 
node have Poisson arrivals and that the durations of these periods are exponentially 
distributed. If A a and A^ represent respectively the arrival rate and the duration rate 
of these overloaded periods, then average number of overloaded nodes at any time is 
N ■ A Q /Ad, where N is the total number of nodes in the system. Although this model 
does not capture all peculiarities of a real system running data-intensive applications it 
gives us a simple way to analyze how different congested scenarios can affect on the 
repair performance of different codes. 

Finally, in our simulated environment the different repair processes are executed as 
follows: For RGC/CRGC, the repair process downloads d fragments from the first d 
nodes out of the live n — / that have a free uploading slot (/ is the number of failures at 
a given time). For SRC and SRCp, it has a list with all the possible pairs of nodes avail- 
able to repair each lost fragment. In the case of SRCp, it uses the first pair of nodes that 
are simultaneously available to upload data. This repair takes then r seconds, plus the 
time (number of discrete time intervals) it had to wait before a suitable pair of blocks 
became simultaneously available. For SRC the repair process needs two nodes to upload 
their fragments in two different time steps. Due to the limited pairs available for each re- 
pair, the repair pair selection can have a significant impact on the repair time. Analyzing 
different SRC repair schedules is out of the scope of this paper and we choose a simple 
strategy: we randomly select the pair of nodes used for each repair. In Section H31 we 
will show that this simplistic policy can have detrimental effects when there are corre- 
lated failures. More sophisticated scheduling mechanisms for SRC would likely yield 
improved repair performance, but exploration of such scheduling mechanisms is out of 
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the scope of this work. In that sense, the results provide a pessimistic baseline of how 
SRC based repairs perform, leaving room for further improvements. 
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Fig. 1: Probability to retrieve a stored object for different number of live nodes and for 
different code parameters. 



4 Evaluation 

In the evaluation of the different codes we distinguish three different scenarios: 

1. Only one node fails (/ = 1) and nodes do not have overloaded periods. 

2. Only one node fails (/ = 1) and nodes have overloaded periods. 

3. There are correlated node failures and multiple encoded blocks for the same object 
may be missing. 

In scenarios 1 and 2 a single fragment is lost for every object stored on the failed node. 
For each scenario we also evaluate three different (n, k) code parameters, namely (7,4), 
(7,3) and (15,5), respectively achieving the storage overheads of 1.75, 2.3 and 3. In 
Figure[T]we depict the static resilience or probability of being able to recover the stored 
object in the presence of node failures of the three different code parameters. These 
results are obtained by enumerating all possible node failure combinations. We can ap- 
preciate how the static resilience of the non-MDS SRC is comparable to that of CCRG. 

4.1 Single Node Failure Evaluation 

To analyze the effects of a single node failure we randomly select one storage node and 
delete all its stored data. Then we start repair processes to regenerate all the missing 
fragments, measuring the repair times and traffic consumed by each of them. Our plot- 
ted results are the average results from 1,000 independent experiments. The single node 
failure analysis is done for Regenerating Codes (RGC), Self-Repairing Codes (SRC) as 
well as for its pipelined version (SRCp). Recall that (i) all nodes store the same amount 
of data B/k per object, (ii) that for one single failure, CRGC reduces to RGC, and (iii) 
that Erasure codes (EC) can be considered as a special case of CRGC code with d = k. 
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Fig. 2: Analysis of one single node failure: performance of different codes is shown as 
a function of the number L of stored objects. The overall amount of data stored across 
N = 1, 000 nodes in the system is B ■ L = 50TB. 
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Evaluating Data Granularity: To measure the impact of different data granularities 
or data partition sizes on the system performance, we assume a system storing a total 
amount of BL = 50TB, where B is the object size and L the number of stored objects. 
These 50TB correspond to the size of the stored data, without redundancy. Storing this 
amount of data using an (n, k) code requires an aggregated node capacity of n ■ B ■ L/k. 
We evaluate different data granularities by running simulations for different values of 
L, where the n redundant fragments are randomly stored into N = 1, 000 nodes. 

In Figure [2] we depict the results for the granularity experiment using a random 
data placement. In Figures [2b] [2d] and [2f| we show the average overall traffic required 
to repair a failed node as a function of the granularity -i.e., the number of objects L. 
Despite some small variations due to the random fragment placement and the averaging 
of 1,000 experiments, we see how the experimental results fit the analytical predictions 
defined in (fl}, (O and ||3}: from (f2]i "fcRGC = d-k+i ' tnus mcreasm g the repair degree 
d in RGC allows to reduce the overall network traffic. SRC/SRCp achieve lower repair 
traffic than RGC only when d < 2k — 2, since from © jsrC — 2. We can also 
appreciate how the (15,5) code, which has the largest storage overhead, requires more 
traffic per failure, since more data needs to be repaired per failure. 

Fi gures |2al |2c] and |2el show in logarithmic scale the average fragment repair times 
for a single node failure. As we can see, SRCp exactly halve the repair times of SRC. 
This implies that for single failures, SRC achieve a good repair performance by using 
any random pair of fragments. It is also interesting to see how even for those RGC 
configurations that require less repair traffic than SRCp, the repair time for SRCp is 
significantly shorter than for RGC. Finally, note that from L = 500 to L = 20, 000, B 
is reduced by a factor of 40, which is roughly the same improvement that we measure 
on repair times when we switch from L = 500 to L = 20, 000. We thus conclude that 
data granularity has no significant impact on the performance of the analyzed codes. 



Evaluating Data Placement Strategies: Networked distributed storage systems also 
need to deal with data placement: 'How to assign redundant fragments to nodes to 
maximize system performance and data reliability?' To measure the impact of different 
data placement strategies, we use the concept of data clustering 0201 . We divide the full 
set of storage nodes into disjoint subsets of nodes called clusters. Each of these clusters 
is an independent storage system with N nodes, which stores all the fragments for a 
given object. We assume that fragments within the cluster are randomly distributed. For 
the smallest value of N, N = n, the data placement becomes a full clustered placement 
where all nodes in the cluster store fragments of the same set of objects. 

In Figure |3] we depict the results for different placement strategies. In Figures [3b] 
[3d] and |3f| we observe no differences in the average overall traffic per failed node for 
different placements. However, in Figures [3a] [3c] and we see how the average frag- 
ment repair time increases exponentially with data clustering since the same network 
resources are needed for all the repairs. Also for any given degree of clustering, SRCp 
consistently and significantly outperforms all other codes. 
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Fig. 3: Analysis of one single node failure: performance of different codes is shown for 
different data clusterings. The size of the objects is B = 1GB, and we store an amount 
of objects proportional to the cluster size, L = 10 • N, 
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Fig. 4: CDF of the amount of data each node uploads in order to repair a single node 
failure. Results are obtained for a configuration with L = 10,000, N = 1,000 and 
B = 1GB. 



Bandwidth Usage Characterization: To conclude the evaluation of single failures, Fig- 
ure|4]illustrates the CDF for the amount of data each node uploads per failure. Steeper 
curves represent better load balancing among nodes. For RGC we can see how large 
d values achieve better traffic balancing among nodes. In contrast, despite using only 
2 nodes per repair, SRC/SRCp achieve a good network traffic balancing, always better 
than the RGC (7,4) code, and better than RGC when d < 4 and d < 7 respectively for 
the (7,3) and (15,5) codes. More balanced usage of bandwidth across all nodes trans- 
lates to fewer contentions for the same resources, hence faster repairs. 



4.2 Repair Performance in Loaded Networks 

We consider four different scenarios with different average number of temporally over- 
loaded nodes, namely 10%, 20%, 50% and 80% of nodes. For each of these scenarios 
we consider two different ways of achieving this percentage of overloaded nodes, (i) 
one where nodes have short overloaded periods but become overloaded at a high rate, 
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each step. 












Avg. duration of the 












"overloaded" state. 


4 


25 


4 50 


4 125 


4 200 



Table 1: Different parameter values to simulate different percentage of overloaded 
nodes when there are N = 1000 nodes in the system. 



and (ii) a second one where nodes have long overloaded periods, but become overloaded 
at a lower rate. Table [TJdepicts the different parameters used in each case. 

In Figure |5]we show the average repair time for the different codes under different 
percentages of overloaded nodes. Fi gures |5a] |5c1 and |5el show the average repair times 
for nodes with long overloaded periods. In general we can appreciate how for all code 
parameters repair times become longer as more nodes are simultaneously overloaded. 
However, it is interesting to see how unlike in non-overloaded networks (Figures [2] 
and [3]), RGC need longer repair times when the repair degree d increases. Although 
large d values reduce the repair traffic, when part of the nodes are overloaded it be- 
comes more difficult to contact with d nodes, and the repair process might need to wait. 
Similarly, Figures l5b] l5dl and l5fl depict the average repair times when nodes have short 
overloaded periods. In this case increasing the value of the repair degree d does not 
have the detrimental effect observed in the previous set of figures and repair times are 
one order of magnitude shorter than for long overloaded periods. 

Regarding the network traffic required to repair lost fragments in loaded systems 
(Figure |6]l we cannot observe noticeable differences. In both cases, for short and long 
overloaded periods the traffic required to repair a failed node is the same than for non- 
overloaded systems. 

4.3 Multiple Node Failures 

We now evaluate the repair performance when a fraction of nodes simultaneously 
fail, where more than one fragment per object may be lost. We thus study Collaborative 
Regenerating Codes (CRGC), which have better performance in terms of both band- 
width utilization and repair latency in comparison to regular RGC. For CRGC we set 
the repair parameter / to the number of failed fragments, and to minimize repair traffic 
we maximize the repair degree, d = n— /. We adapt CRGC's parameters dynamically 
to the number of failed fragments, so that the evaluation shows the best one could pos- 
sibly achieve with CRGC. In Table [2] we depict for both codes the percentage of lost 
objects as a function of the fraction of failed nodes. We can see how the number of 
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Fig. 5: Average repair times for one single node failure when a fraction of the nodes 
are temporally overloaded and cannot receive/send data. The size of the objects is B = 
1GB, and we store and the number of nodes is N = 1000. 
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Fig. 6: Average traffic per failure for one single node failure when a fraction of the 
nodes are temporally overloaded and cannot receive/send data. The size of the objects 
is B = 1GB, and we store and the number of nodes is N = 1000. 
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(15,5)-CRGC 
(15,5)-SRC 


0.00% 
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0.00% 
0.00% 


0.00% 
0.01% 


0.06% 
0.22% 


0.89% 
1.92% 


5.79% 
9.09% 



Table 2: Percentage of unrecoverable objects (lost objects) for CRGC and SRC codes in 
a system with N = 1, 000 nodes. The values are expressed as a function of the fraction 
of failed nodes 0, 



lost objects is increased either when we decrease the redundancy (n/k), or when we 
increase the failure probability 0. We also note that the values on this table depend on 
the total number N of nodes and complete the single-object analysis done in FigureQ] 

Figure Qdepicts the results of the multiple failure experiments. Figures [7b] |7d] and 
|7f| display the average repair traffic per repaired fragment. We see how in the (7,3) 
configuration the SRC/SRCp repair traffic increases with 0. This happens because of 
the massive node failures, which prevent some processes to find any suitable pair of 
fragments to repair, forcing them to reconstruct the whole object. The same happens 
with RGC when less than d fragments survive the failures. The repair process then acts 
as a classical EC lazy repair scheme: one of the repairs reconstructs the original object 
and sends new fragments to the / — 1 others, which as depicted in[7b]and[7d] can reduce 
the traffic for some RGC cases. Finally, note that the classical EC lazy repair technique, 
RGC(<i = fc), does not achieve further traffic savings. 

Figures I7al ITcl and PTel depict the average fragment repair time for different failure 
probabilities 0. The first thing that we notice is that in this scenario SRC takes more 
than twice the repair time required by SRCp. As noted in Section 14.11 this difference 
is caused by the random selection of the pair fragments used to repair. We notice how, 
as the network becomes more saturated -larger values-, the gap between SRC and 
SRCp increases. Furthermore, it may be baffling at first to observe that for the low 
redundancy configuration (7,4), repair times decrease when = 0.5. This is in fact 
due to the high number of lost fragments that cannot be repaired at all for = 0.5 (see 
Table I3, which reduces the network utilization, and hence, reduces the repair times of 
those repairable objects. Note that for this large neither RGC (for large d) nor CRGC 
can repair all objects with the standard repair procedure, and instead have to rely on 
reconstructing the entire object and regenerating the missing fragments from it. 

Finally, similar to Figure|4] Figure |8]illustrates the CDF for the amount of data each 
node uploads to repair a fraction of failed nodes. We only focus on those two config- 
urations that achieve the best performance in terms of network traffic and upload times: 
CRGC and SRCp. As compared to the single failure case, we can see how the traffic 
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Fig. 7: Analysis of the system performance using different codes when a fraction of 
nodes fails simultaneously. The size of the objects is B = 1GB, L = 10, 000 objects 
are randomly stored in N = 1, 000 nodes. 
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Fig. 8: CDF of the amount of data each node uploads in order to repair a fraction of 
failed nodes. Results are obtained for a configuration with L = 10, 000, N = 1, 000 
and B = 1GB. 



to repair multiple failures is not evenly distributed among all nodes. This is due to the 
fact that a fraction O of nodes have no data at all to upload. Besides that, we can also 
appreciate how for (7,3) and (15,5) codes the flexibility of CRGC to use any d-subset 
of live nodes in the repair process makes the overall repair more evenly distributed than 
for SRCp -steeper CDF curves. Evenly distributed repair traffic avoids network con- 
gestions and explains why CRGC, despite needing to contact more nodes than SRCp, 
can in some cases -i.e., for (15,5) code- repair faster than SRCp. 



5 Conclusions 

In this paper we empirically studied and compared the repair performance of novel 
codes - RGC/CRGC and SRC/SRCp - tailor-made for distributed storage in realistic 
settings. We found that for single node failures, in most scenarios, RGC have the largest 
reduction of repair communication (in fact, the more live nodes contacted d > k, the 
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larger the reduction), while SRC/SRCp have slightly less reduction, but significantly 
better performance than traditional erasure codes. We also analyzed the effects of data 
granularity and data placement, showing that although granularity has no impact on re- 
pair performance, highly clustered placements can significantly increase repair times, 
compromising data reliability. When multiple nodes fail simultaneously, for certain 
number of failures, CRGC again has better traffic reduction than SRC/SRCp, but if 
the number of failures is very high, then our results confirm that the traditional erasure 
code approach of the whole object reconstruction to restore the lost fragments is most 
efficient in terms of communication costs, as previously predicted in lfT31 . 

Now in terms of repair time, our new introduced SRCp clearly outperforms all other 
codes in all studied scenarios. In fact, in overloaded environments, the performance gain 
is even more in absence of any well crafted scheduling mechanisms, i.e., using random 
suitable nodes. We would also like to note that many of the results for RGC, and all 
the results of CRGC assume the existence of codes which can support arbitrary and dy- 
namic choices of d and /, and hence determine what is the best that may be achievable 
using such codes - if and when they are designed. Thus, all things considered, we con- 
clude that the pipelined SRC codes have most practical benefit - it is simple, has similar 
bandwidth consumption as SRC (which is typically much lower than traditional erasure 
codes, but slightly more than RGC/CRGC) while providing significant repair speedup 
in diverse environments. 
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